Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 13, 2016

What changes were proposed in this pull request?

The current way of resolving InsertIntoTable and CreateTable is convoluted: sometimes we replace them with concrete implementation commands during analysis, sometimes during planning phase.

And the error checking logic is also a mess: we may put it in extended analyzer rules, or extended checking rules, or CheckAnalysis.

This PR simplifies the data source analysis:

  1. InsertIntoTable and CreateTable are always unresolved and need to be replaced by concrete implementation commands during analysis.
  2. The error checking logic is mainly in 2 rules: PreprocessTableCreation and PreprocessTableInsertion.

How was this patch tested?

existing test.

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70092 has finished for PR 16269 at commit c075191.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTable(
  • case class PreprocessTableCreation(conf: SQLConf) extends Rule[LogicalPlan]
  • class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan]

@cloud-fan cloud-fan changed the title [WIP] simplify data source analysis [SPARK-19080][SQL] simplify data source analysis Jan 4, 2017
@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70878 has finished for PR 16269 at commit e94b187.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTableCommand(
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70880 has finished for PR 16269 at commit 483c31f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTableCommand(
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70882 has finished for PR 16269 at commit 6810dcf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTableCommand(
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70884 has finished for PR 16269 at commit 87be209.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTableCommand(
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71188 has started for PR 16269 at commit 87be209.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71834 has finished for PR 16269 at commit a73134e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTableCommand(
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71839 has finished for PR 16269 at commit 3336699.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateTableCommand(
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71840 has finished for PR 16269 at commit cede3c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71861 has finished for PR 16269 at commit 771a36b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71862 has finished for PR 16269 at commit dcea3a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Will read it carefully tomorrow. Thanks for the great work!

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71908 has finished for PR 16269 at commit 4b68c16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71932 has finished for PR 16269 at commit 48535aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveFileFormat(fileSinkConf: FileSinkDesc)

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72306 has finished for PR 16269 at commit 55a1db3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

t.isInstanceOf[Range] ||
t == OneRowRelation ||
t.isInstanceOf[LocalRelation] =>
failAnalysis(s"Inserting into an RDD-based table is not allowed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: This is moved to PreWriteCheck

plan.foreachUp {
case o if !o.resolved => failAnalysis(s"unresolved operator ${o.simpleString}")
case _ =>
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This movement is great, since extendedCheckRules could output a better error message.

s"column(s) but the inserted data has " +
s"${query.output.size + numStaticPartitions} column(s), including " +
s"$numStaticPartitions partition column(s) having constant value(s).")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: This is moved to PreprocessTableInsertion.

After the above two moves, def checkAnalysis(plan: LogicalPlan) does not have any DDL/DML error handling. It becomes cleaner

/**
* Insert some data into a table.
* Insert some data into a table. Note that this plan is unresolved and has to be replaced by the
* concrete implementations during analysis.
Copy link
Member

@gatorsmile gatorsmile Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers, InsertIntoTable is a unified unresolved node for representing INSERT. After completing the resolution, it will be replaced to InsertIntoDataSourceCommand, InsertIntoHadoopFsRelationCommand or InsertIntoHiveTable.

def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case CreateTable(tableDesc, mode, None) if DDLUtils.isHiveTable(tableDesc) =>
val cmd = CreateTableCommand(tableDesc, ifNotExists = mode == SaveMode.Ignore)
ExecutedCommandExec(cmd) :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: this is moved to HiveAnalysis.

tableDesc,
mode,
query)
ExecutedCommandExec(cmd) :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: the above two are moved to DataSourceAnalysis

query)
ExecutedCommandExec(cmd) :: Nil

case c: CreateTempViewUsing => ExecutedCommandExec(c) :: Nil
Copy link
Member

@gatorsmile gatorsmile Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: Since CreateTempViewUsing is a RunnableCommand, this line is useless


case InsertIntoTable(l @ LogicalRelation(t: InsertableRelation, _, _),
part, query, overwrite, false) if part.isEmpty =>
ExecutedCommandExec(InsertIntoDataSourceCommand(l, query, overwrite)) :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the other reviewers: this is moved to DataSourceAnalysis


/**
* Create a table and optionally insert some data into it. Note that this plan is unresolved and
* has to be replaced by the concrete implementations during analyse.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the other reviewers, CreateTable is a unified unresolved logical representation of CREATE TABLE (AS SELECT). It will be replaced by CreateHiveTableAsSelectCommand, CreateTableCommand, CreateDataSourceTableCommand or CreateDataSourceTableAsSelectCommand.

// We don't want `table` in children as sometimes we don't want to transform it.
override def children: Seq[LogicalPlan] = query :: Nil
override def output: Seq[Attribute] = Seq.empty
override lazy val resolved: Boolean = false
Copy link
Member

@gatorsmile gatorsmile Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan After this change, we are unable to reach the check in checkForStreaming. BTW, it sounds like we do not have any test case to cover these scenarios.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L110-L115

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we will resolve CreateTable, InsertIntoTable to concrete commands, so the check can still work.


/**
* Create a table and optionally insert some data into it. Note that this plan is unresolved and
* has to be replaced by the concrete implementations during analyse.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: analyse -> analysis

*
* Note that, this rule must be run after `PreprocessTableInsertion`.
* Note that, this rule must be run after `PreprocessTableCreation` and
* `PreprocessTableInsertion`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add two more rules OrcConversions and ParquetConversions? HiveAnalysis must be run after them too.

*
* Note that, this rule must be run after [[PreprocessTableInsertion]].
* Note that, this rule must be run after `PreprocessTableCreation` and
* `PreprocessTableInsertion`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same updates are needed here.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 3, 2017

LGTM except a few minor comments. The major concern is about our error checking for structured streaming. It sounds like the test case coverage in that area is weak.

@gatorsmile
Copy link
Member

LGTM pending test

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72444 has finished for PR 16269 at commit 9742f78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in aff5302 Feb 6, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

The current way of resolving `InsertIntoTable` and `CreateTable` is convoluted: sometimes we replace them with concrete implementation commands during analysis, sometimes during planning phase.

And the error checking logic is also a mess: we may put it in extended analyzer rules, or extended checking rules, or `CheckAnalysis`.

This PR simplifies the data source analysis:

1.  `InsertIntoTable` and `CreateTable` are always unresolved and need to be replaced by concrete implementation commands during analysis.
2. The error checking logic is mainly in 2 rules: `PreprocessTableCreation` and `PreprocessTableInsertion`.

## How was this patch tested?

existing test.

Author: Wenchen Fan <[email protected]>

Closes apache#16269 from cloud-fan/ddl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants